Eectiveness of Register Preloading on Cp-pacs Node Processor

نویسندگان

  • Hiroshi Nakamura
  • Ken'ichi Itakura
  • Masazumi Matsubara
  • Taisuke Boku
  • Kisaburo Nakazawa
چکیده

CP-PACS is a massively parallel processor (MPP) for large scale scienti c computations. On September 1996, CP-PACS equipped with 2048 processors began its operation at University of Tsukuba. At that time, CP-PACS was the fastest MPP in the world on LINPACK benchmark. CP-PACS was designed to achieve very high performance in large scienti c/engineering applications. It is well known that ordinary data cache is not effective in such applications because data size is much larger than cache size and because there is little temporal locality. Thus, a special mechanism for hiding long memory access latency is indispensable. Cache prefetching is a well-known technique for this purpose. In addition to cache prefetching, CP-PACS node processors implement register preloading mechanism. This mechanism enables the processor to transfer required oating-point data directly (not via data cache) between main memory and oating-point registers in pipelined way. We compare register preloading with cache prefetching by measuring real performance of CP-PACS processor and HP PA-8000 processor which implement cache prefetching and/or register preloading. The results revealed the following points. First, register preloading outperforms cache prefetching in consecutive data access. This is because cache prefetching suffers from lack of memory/cache throughput. Second, superiority of register preloading to cache prefetching signi cantly increases in stride access. This is because cache prefetching transfers not only the required data but also unnecessary data in the cache line. This leads to the waste of memory/cache throughput. Third, register preloading is also e ective in indirectly indexed vector computations. Fourth, in multidimensional application, register preloading achieves stabler and better performance than cache-based computations even though blocking optimization is utilized in cache-based computations. From these results, it is concluded that register preloading is superior to cache prefetching for large numerical applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tolerating First Level Memory Access Latency in High-Performance Systems

In order to improve performance, future parallel systems will continue to increase the processing power of each node in a system. As node processors, though, can execute more instructions concurrently, they become more sensitive to the rst level memory access latency. This paper presents a set of hardware and software techniques, collectively referred to as register preloading, to effectively t...

متن کامل

TOLERATING FIRST LEVEL MEMORY ACCESS LATENCYIN HIGH - PERFORMANCE SYSTEMSWilliam

In order to improve performance, future parallel systems will continue to increase the processing power of each node in a system. As node processors, though, can execute more instructions concurrently, they become more sensitive to the rst level memory access latency. This paper presents a set of hardware and software techniques , collectively referred to as register preloading, to effectively ...

متن کامل

CP - PACS : A massively parallel processor for large scale scienti c calculations

CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processor with 2048 processing units built at Center for Computational Physics, University of Tsukuba. It has an MIMD architecture with distributed memory system. The node processor of CPPACS is a RISC microprocessor enhanced by Pseudo Vector Processing feature, which can realize high-performance vector pr...

متن کامل

Performance Improvement for Matrix Calculation on CP-PACS Node Processor

CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processing system with 2048 node processors for large scale scienti c calculations. On a node processor of CPPACS, there is a special hardware feature called PVPSW (Pseudo Vector Processor based on Slide Window), which realizes an e cient vector processing on a superscalar processor without depending on th...

متن کامل

Performance evaluation of CP-PACS on CG benchmark

In this research, we evaluate NAS Parallel Benchmarks ver.1 Kernel CG on massively parallel processor CP-PACS, and analyze the result. CP-PACS' CPU has a special register which is auto-incremented by clock cycle, and we can instrument time spent for any function routine with very high accuracy. As a result of performance analysis, especially for data transfer time, our desk-top estimation ts to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007